Techniques for fixing configuration and for fixing code using contextually enriched alerts

ABSTRACT

Systems and methods for automating alert remediation. A method includes extracting entity-identifying values from cybersecurity event data included in alerts generated for a software infrastructure. Queries are generated based on the entity-identifying values. An entity graph is queried. The entity graph has nodes representing respective entities. The entities include software components of the software infrastructure and event logic components of cybersecurity event logic deployed with respect to the software infrastructure. Paths are identified in the entity graph based on the query results. Each identified path is between one of the software components and one of the event logic components. One or more root cause entities are identified based on the identified paths. A fix action plan is generated for the alerts based on the root cause entities.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation-in-part of:

1) U.S. patent application Ser. No. 17/507,180 filed on Oct. 21, 2021,now pending;

2) U.S. patent application Ser. No. 17/815,289 filed on Jul. 27, 2022,now pending; and

3) U.S. patent application Ser. No. 17/816,161 filed on Jul. 29, 2022,now pending. The Ser. No. 17/816,161 application is acontinuation-in-part of U.S. patent application Ser. No. 17/656,914filed on Mar. 29, 2022, now pending.

The contents of the above-referenced applications are herebyincorporated by reference.

TECHNICAL FIELD

The present disclosure relates generally to cybersecurity for computingenvironments, and more specifically to various techniques for securingcomputing environments using enriched alerts.

BACKGROUND

Infrastructure as code (IaC) is a management technique of computinginfrastructure in a high-level descriptive model. IaC allows forautomating the provisioning of information technology (IT)infrastructure without requiring developers or infrastructure engineersto manually provision and manage servers, operating systems, databaseconnections, storages, and other infrastructure elements whendeveloping, testing, and deploying software applications. The goal ofIaC is generally to provision cloud resources from code.

In IaC, an infrastructure may include various computing infrastructureresources such as network adapters, applications, containers, and thelike, each of which can be implemented as code. An IaC file isinterpreted or executed in order to provision these computinginfrastructure resources within a cloud environment.

A key aspect of managing infrastructure is securing the infrastructureagainst potential cyber threats. To this end, most virtualized executionenvironments deploy several cybersecurity detection tools to monitor forabnormalities in different parts of the software development pipelinesuch as code, container repositories, production containers, and thelike. These tools may generate alerts when abnormal or otherwisepotentially vulnerable code or configuration is detected. In manyimplementations, the different tools scan for alerts in different partsof the pipeline. An alert is a collection of events that, takentogether, are significant from a cybersecurity perspective. Each alertmay be realized as or may include text indicating the type of potentialrisk, the events involved, relevant times, and the like.

Securing the infrastructure against potential cyber threats thereforerequires identifying improper configurations of infrastructure resourcesand/or improperly written code for those resources. Once such issueswith configuration and/or code are identified, appropriate steps may betaken to address the issues. To adequately address these issues,relevant components within the infrastructure must be identified so thatremedial actions can be directed to those relevant components. However,existing solutions typically rely on subjective judgments by humanobservers who are familiar with the design of the infrastructure orpremade documentation made by architects of the infrastructure. Each ofthese kinds of existing solution relies on subjective decisions made byhuman observers, which can lead to inconsistencies and result in failingto properly identify the relevant entities for purposes of addressingissues in the infrastructure. Further, these solutions rely onsubjective judgments about how to address issues based on perceivedrelationships between infrastructure components.

In various IaC techniques, a set of interrelated modules containingresource definitions that collectively represent a desired state of acomputing environment are maintained. Each module may include a set ofsource files. As a specific example, Terraform, a common IaC languageand framework, uses Terraform applications which are initiated byexecution of the “terraform apply” command with respect to a certainTerraform module. This is also referred to as applying that module.Terraform applications often take place as part of an organization'scode-to-cloud pipeline, typically in an automatic and periodic manner.When a Terraform application creates or otherwise manages a cloudresource, it records an association between a language-specificidentifier (e.g., a Terraform identifier) of the resource and a globallyunique identifier (GUID) of the resource in a configuration mapping filesuch as a state file.

Each Terraform module defines a set of Terraform resources. Terraformmodules may depend on each other in order to incorporate Terraformresource definitions from other Terraform modules. For some IaClanguages like Terraform, a unique identifier such as a GUID is notmaintained for each cloud resource (e.g., each Terraform resource) whichmay be utilized by modules applied using the respective IaC techniquesand code.

A first Terraform module M may depend from a second Terraform module N,and the second Terraform module N may in turn depend on a thirdTerraform module T. In such a case, it can be said that T is atransitive dependency of M. In other words, it can be said that Mindirectly depends on T (i.e., through its dependency on N which in turndepends on T). When Terraform module M is applied, the Terraform code istasked with synchronizing the state of all resources defined by module Malong with all of the resources defined by the modules N and T on whichmodule M depends. This synchronization may include creating, deleting,and modifying resources.

A root module is a module which is applied directly, i.e., not only as adependency of another module. Organizations often maintain several rootmodules as well as many non-root modules, where each non-root moduleonly acts as a dependency for other modules and is not deployeddirectly.

In other IaC implementations, information identifying cloud-basedresources may be stored differently. For example, in Azure ResourceManager (ARM) implementations, a cloud-based resource identifier may bestored directly in a source file rather than using a configurationmapping file to maintain associations between GUIDs and cloud-basedresource identifiers as might be performed for Terraform.

Techniques that improve automated alerting and remediation are highlydesirable for protecting computing infrastructure against cyber threats.

SUMMARY

A summary of several example embodiments of the disclosure follows. Thissummary is provided for the convenience of the reader to provide a basicunderstanding of such embodiments and does not wholly define the breadthof the disclosure. This summary is not an extensive overview of allcontemplated embodiments, and is intended to neither identify key orcritical elements of all embodiments nor to delineate the scope of anyor all aspects. Its sole purpose is to present some concepts of one ormore embodiments in a simplified form as a prelude to the more detaileddescription that is presented later. For convenience, the term “someembodiments” or “certain embodiments” may be used herein to refer to asingle embodiment or multiple embodiments of the disclosure.

Certain embodiments disclosed herein include a method for automatingalert remediation. The method comprises: extracting a plurality ofentity-identifying values from cybersecurity event data included in aplurality of alerts generated for a software infrastructure; generatingat least one query based on the plurality of entity-identifying values;querying an entity graph using the at least one query, wherein theentity graph has a plurality of nodes representing respective entitiesof the plurality of entities, wherein the plurality of entities includesa plurality of software components of the software infrastructure and aplurality of event logic components of cybersecurity event logicdeployed with respect to the software infrastructure; identifying atleast one path in the entity graph based on the results of the at leastone query, wherein each identified path is between one of the pluralityof software components and one of the plurality of event logiccomponents; identifying at least one root cause entity based on theidentified at least one path; and generating a fix action plan for theplurality of alerts based on the identified at least one root causeentity.

Certain embodiments disclosed herein also include a non-transitorycomputer readable medium having stored thereon causing a processingcircuitry to execute a process, the process comprising: extracting aplurality of entity-identifying values from cybersecurity event dataincluded in a plurality of alerts generated for a softwareinfrastructure; generating at least one query based on the plurality ofentity-identifying values; querying an entity graph using the at leastone query, wherein the entity graph has a plurality of nodesrepresenting respective entities of the plurality of entities, whereinthe plurality of entities includes a plurality of software components ofthe software infrastructure and a plurality of event logic components ofcybersecurity event logic deployed with respect to the softwareinfrastructure; identifying at least one path in the entity graph basedon the results of the at least one query, wherein each identified pathis between one of the plurality of software components and one of theplurality of event logic components; identifying at least one root causeentity based on the identified at least one path; and generating a fixaction plan for the plurality of alerts based on the identified at leastone root cause entity.

Certain embodiments disclosed herein also include a system forautomating alert remediation. The system comprises: a processingcircuitry; and a memory, the memory containing instructions that, whenexecuted by the processing circuitry, configure the system to: extract aplurality of entity-identifying values from cybersecurity event dataincluded in a plurality of alerts generated for a softwareinfrastructure; generate at least one query based on the plurality ofentity-identifying values; query an entity graph using the at least onequery, wherein the entity graph has a plurality of nodes representingrespective entities of the plurality of entities, wherein the pluralityof entities includes a plurality of software components of the softwareinfrastructure and a plurality of event logic components ofcybersecurity event logic deployed with respect to the softwareinfrastructure; identify at least one path in the entity graph based onthe results of the at least one query, wherein each identified path isbetween one of the plurality of software components and one of theplurality of event logic components; identify at least one root causeentity based on the identified at least one path; and generate a fixaction plan for the plurality of alerts based on the identified at leastone root cause entity.

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter disclosed herein is particularly pointed out anddistinctly claimed in the claims at the conclusion of the specification.The foregoing and other objects, features, and advantages of thedisclosed embodiments will be apparent from the following detaileddescription taken in conjunction with the accompanying drawings.

FIG. 1 is a schematic diagram of logical components of a pipelinemanager utilized to describe various disclosed embodiments.

FIG. 2 is a flow diagram illustrating example phases for computinginfrastructure security operations.

FIG. 3 is a flowchart illustrating a method for automated alertprocessing and fixing according to an embodiment.

FIG. 4 is a flowchart illustrating a method for creating a knowledgebase according to an embodiment.

FIG. 5 is a flowchart illustrating a method for normalizing resourcedefinitions according to an embodiment.

FIG. 6 is a flowchart illustrating a method for mapping a computinginfrastructure pipeline according to an embodiment.

FIG. 7 is a flowchart illustrating a method for deduplicating andprioritizing alerts according to an embodiment.

FIG. 8 is a flowchart illustrating a method for generating a fix actionplan according to an embodiment.

FIG. 9 is a schematic diagram of a hardware layer of a pipeline manageraccording to an embodiment.

DETAILED DESCRIPTION

FIG. 1 shows an example schematic diagram of logical components of apipeline manager 100 utilized to describe various disclosed embodiments.In FIG. 1 , the logical components include a query engine 110, an entitygraph database (DB) 120, a real-time engine 130, a state manager 140, arelations manager 150, one or more enrichers 160, and a resource finder170.

The query engine 110 is configured to generate or otherwise receivequeries (not shown) for execution with respect to an entity graph storedin the graph DB 120. Specifically, such queries may be performed, forexample, in order to map steps of pipeline execution (e.g., as describedbelow with respect to FIG. 6 ), to identify correlations betweencomponents indicated in alerts (e.g., as described below with respect toFIG. 7 ), to identify paths in the entity graph (e.g., as describedbelow with respect to FIG. 8 ), and the like. The query engine 110 isfurther configured to query the graph DB 120, either directly or via thereal-time engine 130, in order to retrieve information related tomappings of entities within the graph DB 120.

The graph DB 120 at least includes an entity graph (not separatelydepicted). The entity graph maps entities among a computinginfrastructure and may be created, for example, as discussed below withrespect to FIG. 4 . The entity graph maps the software infrastructureincluding connections among software components acting as entities ofthe entity graph. To this end, the entity graph includes nodes andedges. The nodes represent distinct logical entities such as, but notlimited to, software components, event logic components, and the like.The edges connect entities based on correlations between theirrespective entities (e.g., correlations derived as discussed below withrespect to S410).

The entity graph may further include entity-identifying valuesrepresenting specific entities such as, but not limited to, resourcename, unique identifier, and the like. The entity graph may also includenodes representing code owners, such person to which the notificationshould be sent may be a person, team, business unit, and the like,represented by a node linked to the root cause entity in the entitygraph.

The entity graph provides an end-to-end view of all domains of thesoftware infrastructure including connections between components ofthose domains, thereby establishing potential connections between anytwo given components in the software infrastructure and their respectivedomains. In accordance with various disclosed embodiments, the entitygraph includes schematic data linking different domains anddemonstrating linkages within each domain. The domains include domainsrepresenting various layers of the software infrastructure as well asdomains representing event logic components (e.g., policies, codedefining business logic, queries, etc.) related to cybersecurity events.The event logic components represented in the entity graph may include,but are not limited to, metadata, definitions, policies or othercomponents linked to a cybersecurity event (e.g., code definingdetection logic used to detect cybersecurity events, queries whichresulted in alerts triggering, etc.).

By graphing domains including both portions of the softwareinfrastructure and event logic components related to cybersecurityevents which may be triggered with respect to the softwareinfrastructure, the entity graph can be queried in order to determinepaths of nodes connecting entities to event logic components, therebyestablishing the root cause of any given cybersecurity event as theentity connected to the event logic components related to thecybersecurity event.

The real-time engine 130 may be utilized to split processing of queriesdepending on need. For example, when query information is needed inreal-time (e.g., to remediate an ongoing alert or alerts), the query maybe provided to the real-time engine 130 from the query engine 110 andexecuted upon the graph DB 120 in real-time. When real-time processingis not required, queries may be batched or otherwise held and thenprocessed later. Bifurcating queries in this manner allows foroptimizing query processing speed for the real-time queries.

The entities and mapping in the graph DB 120 may be managed by a statemanager 140. The state manager 140 may monitor for changes orinconsistencies in the entity graph in order to determine when and howto modify the entity graph to ensure that the entity graph accuratelyreflects the underlying software infrastructure. To this end, the statemanager 140 may include a state database 145 which stores the entitiesand relationships discovered during analysis of the softwareinfrastructure. The state manager 140 acts as a source of truth for theentity-related data and may intake identifications of resources whichmay be entities within the entity graph. The state database 145 mayfurther store other information related to processing data related toentities such as, but not limited to, a semantic concepts dictionaryused for semantically analyzing properties in original definitions ofresources.

To aid in managing the state of entities represented within the entitygraph, the state manager 140 may communicate with the relations manager150, one or more enrichers 160, or both. The relations manager 150 maybe configured to monitor for broken links within the entity graph, i.e.,relationships (e.g., represented by edges) which include connections toentities (e.g., represented by nodes) which are no longer reflected inthe entity graph. The relations manager 150 may further identify missingintegrations, and may optionally send notifications to an admin of thesoftware infrastructure in order to restore missing integrations. Thus,the relations manager 150 allows for maintaining relationships betweenresources represented in the entity graph as well as for removinginactive relationships from the entity graph in order to ensure that theentity graph continues to accurately reflect the connections betweenentities.

The enrichers 160 may be configured to enrich data related to resourcesand, in particular, to enrich batches of data related to multipleresources. To this end, the enrichers 160 may be configured to generateinsights related to communications between resources or otherwiserelated to potential relationships between resources.

The state manager 140 may further be configured to communicate with aresource finder 170. The resource finder 170 is configured to processdata related to potential entities of a software infrastructure in orderto identify such resources and to obtain data which may be stored orotherwise used by the state manager 140. To this end, the resourcefinder 170 may include one or more fetchers 171 and one or more parsers172. The resource finder 170 may further include additional components(not shown) to handle processing of data such as, but not limited to, afetch queue manager, a fetch scheduler, a fetch normalizer, a web hookreceiver, combinations thereof, and the like.

The fetchers 171 are configured to perform one or more functions forfetching data from data sources within a software infrastructure. Suchfunctions may include, but are not limited to, accessing relevantapplication programming interfaces (APIs), retrieving and storing rawdata, passing data to the parsers 172, and utilizing softwaredevelopment kits (SDKs) in order to access data in the sources. In someimplementations, each source has a respective fetcher 171.

The parsers 172 are perform one or more functions for parsing datafetched from the data sources within the software infrastructure. Suchfunctions may include, but are not limited to, handling fetched data,accessing raw data in buckets, extracting resources and relationsbetween resources, and publishing new fetch missions. The outputs of theparsers 172 may include, but are not limited to, identifications ofresources and connections between resources to be utilized by the statemanager 140.

It should be noted that various logical components are depicted in FIG.1 merely for example purposes and without limitation on the disclosedembodiments. Additional components which are not depicted may beincorporated into the pipeline manager 100 without departing from thescope of the disclosure. Further the fetchers 171 and the parsers 172may be integrated, for example, a single component configured to performboth fetching and parsing, without departing from the scope of thedisclosed embodiments.

FIG. 2 is a flow diagram 200 illustrating example phases for computinginfrastructure security operations. The flow diagram 200 illustrates adiscovery phase 210, a reduction phase 220, and a fixing phase 230. Eachof these phases is discussed in further detail below with respect to thefollowing flowcharts.

As depicted in FIG. 2 , the discovery phase 210 includes mapping acomputing infrastructure pipeline and analyzing potential origins ofrisks. The result is a knowledge base which can be subsequently queriedin order to derive information about potential root causes of alerts.Further, the discovery phase 210 may include identifying owners of codeor other portions of the software infrastructure (i.e., engineers orcomputer scientists who wrote the code behind those portions) andmapping those owners in order to aid in attribution for purposes of,e.g., root cause analysis and fixing.

The reduction phase 220 includes reducing alerts via deduplication andprioritization, as well as identifying root causes and owners of rootcause components. The result is a reduced set of data related to alerts,which may be enriched with data indicating the root causes, which cantherefore be processed and utilized for implementing fixes moreefficiently than the unreduced data.

The fixing phase 230 includes generating fix actions or otherwiseproposing fixes. To this end, the fixing phase 230 further includesgenerating a fix action plan. The result of the fixing phase 230 mayinclude, but is not limited to, a notification indicating the fix actionplan, sets of computer-executable instructions for executing at least aportion of the fix action plan, both, and the like.

FIG. 3 is a flowchart 300 illustrating a method for automated alertprocessing and fixing according to an embodiment. In an embodiment, themethod is performed by the pipeline manager 100, FIG. 1 .

During a discovery phase 301, resources are identified and mapped in aknowledge base. Specifically, at S310, a knowledge base is created. Tothis end, S310 may include, but is not limited to, deriving correlationsbetween software components by analyzing SDLC pipeline data and logdata, and creating an entity graph mapping such software components withrespect to the derived correlations. Additionally, S310 may includecreating or otherwise incorporating a semantic concepts dictionarydefining potential characteristics of entities which may be representedin the entity graph into the knowledge base. An example method forcreating a knowledge base is described further below with respect toFIG. 4 .

At S320, the pipeline is mapped so as to demonstrate potential originsof risks. To this end, S320 may include, but is not limited to,enumerating steps of pipeline execution and mapping the enumerated stepswith respect to components of the software development infrastructure,in particular, components represented as entities within the entitygraph of the knowledge base. Moreover, the enumeration may be arecursive enumeration that begins at a top-level service identifier. Thesteps may further be classified to increase granularity of the mapping,thereby improving any cybersecurity decisions determined utilizing themapping. An example method for mapping a pipeline is described furtherbelow with respect to FIG. 6 .

In some embodiments, the mapping further includes nodes representingowners of code or portions of code. Specifically, such owners may be,but are not limited to, engineers, computer scientists, or otherentities who wrote the code or portions thereof for software componentsamong the software infrastructure. In other words, the owner of a pieceof code is an entity who wrote the code and, consequently, may be anappropriate person to help fix any problems with the code if fullyautomated remediation is not being performed. Mapping code owners withinthe entity graph and, in particular, mapping code owners to theirrespective portions of code, allows for attributing root causes withrespect to such portions of code to their owners, which in turn allowsfor automatically identifying appropriate code writers to notify ofpotential problems within the software infrastructure requiring fixing.

During a reduction phase 302, alerts are reduced via deduplication andprioritization as well as by identifying root causes and the entitieswhich own the root causes. The result is a reduced set of informationwhich can be processed more efficiently. To this end, at S330, alertsare obtained. The alerts may be received from one or more cybersecuritytools such as, but not limited to, scanners. Alternatively, the alertsmay be retrieved from a repository storing alerts from suchcybersecurity tools. The alerts at least indicate cybersecurity eventsrelated to components in the software infrastructure.

At S340, alerts are deduplicated and prioritized. Specifically, theentity graph is queried in order to determine correlations betweencomponents involved in the alerts across different portions of thesoftware development pipeline. Alerts are matched based on the queryresults in order to identify duplicate alerts and then deduplicated byremoving duplicate instances of alerts. The alerts may be prioritizedusing one or more prioritization rules that provide a ranking or scoringscheme for ranking alerts, and alerts with higher rankings areprioritized over alerts with lower rankings. More specifically, theranking or scoring may be based on types of components involved inevents, the data attached to the event, the locations of componentsinvolved in an event within the software infrastructure (e.g., within aparticular portion of the software development pipeline), types ofconnections (e.g., based on classifications of steps), combinationsthereof, and the like. An example method for reducing alerts includingdeduplication and prioritization is described further below with respectto FIG. 7 .

At S350, root causes and owners are identified. In an embodiment, S350includes traversing the entity graph beginning at components involved inalerts and continuing through identification of connected components(either connected directly by edges in the entity graph or indirectlythrough connections to other nodes). The traversal may be performeduntil one or more traversal end points are reached. Such end points mayinclude, but are not limited to, terminal components (i.e., componentsrepresented by nodes which have no connections in the entity graph whichhave not already been traversed during this iteration), componentslocated in certain portions of the software infrastructure, both, andthe like.

During a fixing phase 303, a fix action plan is generated and sent forimplementation, executed, or a combination thereof, thereby fixing theproblems identified in the reduced set of alerts. To this end, thesystem conducting the fixing may integrate with native tools in theinfrastructure and may execute part or all of the fix action plan viasuch integration. As a non-limiting example, a fix action plan may begenerated and an external workflow may be generated by that system. Theexternal workflow may involve using the native tools, and may only beresolved once appropriate activities are performed by the native tools.To this end, the native tools may be preconfigured with a fix workflowprocess such as, but not limited to, a change management approvalprocess, in order to implement, build, and deploy the fix.

At S360, a system conducting the fixing is integrated with nativedevelopment lifecycle tools. Such native development lifecycle tools mayinclude third party infrastructure management tools such as, but notlimited to, code repositories, ticketing or notification systems, CI/CDmanagers, identity providers, code scanners, IaC tools, containerrepositories, automated security tools, cloud provider infrastructures,vulnerability management tools, combinations thereof, and the like. Theintegration allows for creating a fix action plan which utilizes theappropriate tools as deployed in the infrastructure with respect todifferent root causes based on the deployment of those tools relative tothe components which are determined to be root causes of alerts within acomputing infrastructure. In other words, the fix action plan may bebased on the location of each root cause entity within the entity graph,and any mitigation actions performed according to the fix action planmay utilize native development lifecycle tools configured to remediateissues in locations of the software infrastructure where root causeentities are located.

At S370, a fix action plan is generated. The fix action plan is definedwith respect to components within the software infrastructure (i.e.,components represented in the entity graph), and may be determined so asto indicate corrective actions for components which are determined asroot causes. An example method for generating a fix action plan isdescribed further below with respect to FIG. 8 .

At S380, the fix action plan is implemented. Implementing the fix actionplan may include, but is not limited to, sending a notificationindicating the fix action plan, for example to an operator for manualimplementation. Alternatively, implementing the fix action plan mayinclude automatically acting upon the affected components in accordancewith the generated fix action plan. To this end, in such animplementation, S380 may further include generating computer-executableinstructions for performing the steps of the fix action plan. Further,in at least some embodiments, implementing the fix action plan mayinclude a combination of sending a notification and automaticallyexecuting the fix action plan.

FIG. 4 is a flowchart S310 illustrating a method for creating aknowledge base of semantic concepts and entity-identifying valuesaccording to an embodiment.

At S410, correlations between software components are derived byanalyzing software development lifecycle (SDLC) pipeline data (e.g.,data of a continuous integration [CI] and continuous delivery [CD]pipeline). Such SDLC data may include, but is not limited to, a pipelineconfiguration, build scripts, source code, combinations thereof,portions thereof, and the like. The correlations are identified based onreferences between software components indicated in such data, staticanalysis of software components, semantic analysis of text related tothe software components, combinations thereof, and the like.

At S420, source control is linked to binaries of one or moreapplications based on the derived correlations. In an embodiment, S420includes extracting uniquely identifying features of the source controlartifact and binaries from the analyzed data. In a further embodiment,the linking is limited to pairs of binaries and source control artifactsselected from limited set of binaries and source control artifacts,respectively.

At S430, log data (e.g., log files) is analyzed for correlations. Tothis end, S430 may include identifying actions taken by softwarecomponents and events which may be caused by those actions. Theserelationships may be identified based on circumstances such as, but notlimited to, events occurring shortly after those actions, determinationsthat events which could logically have been caused by the actions,combinations thereof, and the like. The identification of S430 may bebased on probabilistic analysis such that, for example, correlationshaving likelihoods above a threshold are identified.

As a non-limiting example, by analyzing log files from an integration ordeployment server, links between code commits and binary hashes (and,consequently, the corresponding entities involved) may be identified. Asanother non-limiting example, by analyzing of files in a cloudenvironment, information identifying entities used by automation enginesmay be identified.

In this regard, it has been identified that correlations indicatedbetween log files can demonstrate that particular deployments occurredpreviously, which in turn aids in providing visibility to the DevOpspipeline in situations where static analysis might not satisfy theconstraints, and may further aid in finding hidden automation. This, inturn, provides additional information about relationships betweensoftware components and entity logic components which can be utilized insome non-limiting examples to more accurately identify root causes asdiscussed below with respect to FIG. 8 .

It should be noted that S420 and S430 are both depicted as part of theflowchart 400, but that various embodiments may include either S420 orS430 without departing from the scope of the disclosure. As anon-limiting example, source control may be linked to binaries at S420for implementations involving software containers, but such animplementation may not involve analyzing log files for correlations.Which steps are utilized for a given implementation may depend, forexample, on the types of components deployed in the infrastructure orotherwise based on the types of components being evaluated.

At S440, resource definitions are normalized. In an embodiment, S440includes identifying properties in an original definition of eachsoftware infrastructure resource, semantically analyzing the identifiedproperties, and mapping the properties based on the results of thesemantic analysis. The outcome is a set of properties in a universaldefinition format. An example method for transforming cloud resourcedefinitions which may be utilized to normalize such definitions isdescribed further below with respect to FIG. 5 .

At S450, an entity graph is created based on the correlations identifiedat any or all of S410 through S430 using the normalized resourcedefinitions. The entity graph includes nodes and edges. The nodesrepresent distinct logical entities such as, but not limited to,software components, event logic components, and the like. The edgesconnect entities based on the correlations identified at S410 throughS430. The edges therefore represent relationships between pairs ofentities, which in turn form paths as one navigates from a first entityto a second, from the second to a third, and so on. The paths followingedges between nodes may therefore be utilized to identify connectionsbetween different entities (e.g., between event logic components andsoftware components), thereby allowing for automatically and objectivelyidentifying root causes of cybersecurity events.

In some embodiments, S450 further includes incorporating translatedentity-defining datasets into the entity graph. To this end, in suchembodiments, S450 includes embedding translated data into the entitygraph, and S450 may further include performing such translation. Theentity-defining datasets provide explicit definitions of features ofpotential entities to be included in the entity graph. As a non-limitingexample, such a dataset may be a schema of a DevOps tool (e.g.,Terraform) that defines the function performed by each portion of thetool. Further incorporating such explicitly defined features allows forfurther increasing the granularity of the graph, thereby furtherimproving applications of said graph in identifying connections betweencybersecurity event data and event logic components.

At S460, a semantic concepts dictionary is created. The semanticconcepts dictionary may be populated with predetermined semanticconcepts. The semantic concepts indicate potential characteristics ofentities in the entity graph such as, but not limited to, type (e.g.,“Docker container”), potential identifiers (e.g., an Internet Protocoladdress), build automation, configuration, portions thereof,combinations thereof, and the like. Such semantic concepts provideadditional information regarding entities which may be used to improvethe accuracy of root cause identification by providing additionalidentifying data for entities that can be queried. These semanticconcepts indicating potential characteristics of entities may beincluded as nodes in the entity graph, or may be included in data ofnodes of the entity graph.

At S470, a knowledge base is built. The knowledge base includes theentity graph and the semantic concepts dictionary.

Once built, the knowledge base can be queried as described herein (forexample, as discussed with respect to FIGS. 3 and 8 ) in order todetermine connections between software components and cybersecurityevents or potential cybersecurity events, thereby providing contextrelated to cybersecurity events and allowing for automaticallysuggesting remedial actions to address cybersecurity events based onsuch contexts.

It should be noted that the steps of FIG. 4 are depicted in a particularorder, but that the steps are not necessarily limited to the orderdepicted. As a non-limiting example, the semantic concepts dictionarymay be created before or in parallel with any of the steps S410 throughS450 without departing from the scope of the disclosure.

FIG. 5 is a flowchart S440 illustrating a method for transformingdefinitions of cloud resources according to an embodiment.

At S510, original definitions of computing infrastructure resources areanalyzed. In an embodiment, S510 includes applying a policy engine toeach original definition. Each original definition is expressed in anoriginal IaC language.

In an embodiment, each original definition may be expressed in adeclarative IaC language or in an imperative IaC language. A declarativeIaC language is used to define resources in terms of the resultingresources to be deployed, while an imperative IaC language is used todefine resources in terms of the commands to be utilized to deploy theresources which will result in the desired resources. The analysisperformed at S510 may depend on the type of language (i.e., declarativeor imperative) in which each original definition is expressed. Morespecifically, when the original definition for a resource is expressedin an imperative IaC language, S510 may further include performingcontrol flow analysis on the imperative IaC language original definitionin order to determine which resources are to be deployed.

In another embodiment, when the original definition is expressed in animperative IaC language, S510 may further include instrumenting an IaCprocessor or IaC script corresponding to the original definition andanalyzing the results of the execution. Instrumenting the IaC processoror IaC script (also referred to instrumenting IaC instructions of theIaC processor or IaC script) includes, but is not limited to, executinginstructions of the IaC processor or IaC script and monitoring theexecution for potential errors or abnormalities. The instrumentedinstructions may be executed in a controlled environment, e.g., anenvironment where the interactions with external systems and code (i.e.,external to the controlled environment) is impossible, limited, orotherwise restricted to minimize harm that may come from executinguntrusted code. In a further embodiment, instrumenting the IaC processoror script may also include replacing a backend of such processor orscript with mock code and analyzing the output.

At S520, properties in the original definitions are identified. Theproperties include, but are not limited to, cybersecurity-relatedproperties, at least some of which may be used to collectively uniquelyidentify a type of computing infrastructure resource. To this end, theproperties may include, but are not limited to, configurationproperties, encryption properties, backup properties, propertiesindicating connections to external systems, combinations thereof, andthe like. The properties may further include other data describing oridentifying computing resources such as, but not limited to, tags.

The properties may further include properties which are not necessarilyused to determine a type of computing infrastructure resource but,instead, relate to a configuration which may differ even between thesame types of computing infrastructure resource. As a non-limitingexample, such a property for a virtual machine (VM) may be a propertyindicating whether the VM has Internet access, which some VMs may haveand others may have not.

The properties may be identified using a policy engine configuredaccording to an open policy engine language. As a non-limiting example,a policy engine defined via the open source language Rego may beutilized to identify the properties. Specifically, the properties may beidentified as configuration properties, effects on a computingenvironment when the resource is deployed in the computing environment(e.g., resources created by the computing infrastructure resource,modifications to the computing environment made by the resource, etc.),both, and the like. These configurations and effects can be encoded intoa single, unified format, in order to create the universal definitionsof computing infrastructure resources.

Further, as noted above, for original definitions expressed in animperative IaC language that are analyzed using control flow analysis,identifying the properties of the corresponding computing infrastructureresource may further include analyzing the commands indicated in theoriginal definition in order to identify properties of the computinginfrastructure resources deployed as a result of those commands.

Alternatively, identifying the properties for an imperative IaC languageoriginal definition may include instrumenting instructions of acorresponding IaC processor or script for the original definition andanalyzing the results of such execution (e.g., the outputs andinstrumentation logs resulting from instrumenting the instructions).

At S530, the identified properties are semantically analyzed. In anembodiment, S530 includes comparing the identified properties topredefined semantic concepts of known potential properties for computinginfrastructure resources. More specifically, the known potentialproperties may be properties indicated or otherwise represented inuniversal definition templates. The semantic concepts may be semanticconcepts included in a semantic concepts dictionary and stored in, forexample, a state database such as the state database 145, FIG. 1 .

At S540, the properties are mapped to universal definition templatesbased on the semantic analysis. That is, each property identified in oneof the original definitions is mapped to a respective propertyrepresented in one or more of the universal definition templates.Accordingly, the mapping can be used to determine a matching universaldefinition template for a computing infrastructure resource based on acombination of the identified properties mapped to the properties ofthat universal definition template. In an embodiment, when the semanticanalysis yields an identification of a type of the computing resource, auniversal definition template associated with the identified type of thecomputing resource may be used to map the properties to the respectiveuniversal definition template.

At S550, a universal definition is output for each of the computinginfrastructure resources based on the mapping. In an embodiment, S550includes inserting one or more properties represented in the originaldefinition of each computing infrastructure resource into respectivefields of the matching universal definition template determined for thecomputing infrastructure resource at S540. The result of S550 is auniversal definition for each of the computing infrastructure resourcesexpressed in a unified format, which allows for application of policiescreated using the unified format.

As a non-limiting example, for a Cryptographic Key computinginfrastructure resource universal definition, an example universaldefinition may include the following properties represented in a unifiedformat:

-   -   1. Allowed actions=encrypt, sign    -   2. Key rotation configuration=rotated every 24 hours    -   3. Symmetric?=true

FIG. 6 is a flowchart S320 illustrating a method for updating a mappingof a software development pipeline according to an embodiment.

At S610, software development pipeline data is accessed or otherwiseobtained. The software development pipeline data may be, for example,software development lifecycle (SDLC) pipeline data (e.g., data of acontinuous integration [CI] and continuous delivery [CD] pipeline). SuchSDLC data may include, but is not limited to, a pipeline configuration,a pipeline definition, build scripts and other scripts used in thepipeline (e.g., deployment scripts, validation scripts, testing scripts,etc.), source code, logs, manifests, metadata, combinations thereof,portions thereof, and the like. In some embodiments, the softwaredevelopment pipeline data may be accessed using computing interfacepermissions provided by an operator of the software developmentpipeline. The accessed software development pipeline may be, but is notlimited to, data stored in a source control, data retrieved via an API,data uploaded by a user for analysis, combinations thereof, and thelike.

At S620, steps of pipeline execution for one or more softwaredevelopment pipelines are enumerated. Each step is a procedure includinga set of instructions (e.g., machine-readable computer instructions) forperforming one or more respective tasks. In this regard, it is notedthat a given software development pipeline includes one or more softwarecomponents in a computing environment which may be accessed viaprocedures. Thus, the steps are enumerated such that the procedures usedto access different components of the software developmentinfrastructure within the pipeline can be identified and analyzed.

In an embodiment, S620 includes analyzing the logs, manifests, andmetadata of the software development pipeline data. In a furtherembodiment, S620 may include performing a recursive enumeration thatstarts with a top-level identifier for a service (e.g., an organizationidentifier of an organization that owns or operates the service to bebuilt using the software development pipeline). The recursiveenumeration includes identifying, using data accessed via computinginterfaces, components within the service in layers, with data relatedto components in one layer being used to enumerate components in thenext layers. In other words, portions of the software developmentinfrastructure are iteratively enumerated in multiple iterations byenumerating components within each layer of the software developmentinfrastructure at each iteration. During this recursive enumeration,pipelines may be identified and then steps within the pipeline may beenumerated.

In this regard, it is noted that a software development infrastructuretypically includes various logical components that encapsulate differentaspects of the software development infrastructure with varyinggranularities. In other words, some aspects include others in a layeredmanner. As a non-limiting example, a top-level software developmentservice (top layer/layer 1) to be built may include projects andrepositories (layer 2), where each project includes one or morepipelines (layer 3), each pipeline includes jobs (layer 4), and each jobutilizes one or more steps (layer 5). The sub-components of each logicalcomponent are reflected in the logs, manifests, and metadata of thesoftware development infrastructure (i.e., the software developmentpipeline data accessed at S610) such that these sub-components can beidentified, thereby enumerating components in each layer and ultimatelyenumerating steps in one of the layers. Further, relationships betweenand among these components and sub-components can be unearthed throughthis recursive enumeration.

To this end, in a further embodiment, S620 includes recursivelyenumerating all of the projects and repositories under the top-levelidentifier of a software development service using computing interfacesof the pipeline (e.g., using the provided computer interfacepermissions). For each project enumerated this way, the computinginterfaces are used to enumerate all of the pipelines of the project,then the jobs of each pipeline, and finally the steps taken in eachjob's run. The result is a complete enumeration of all steps used forpipeline execution of software development pipelines within the softwaredevelopment infrastructure.

In another embodiment, one or more of the steps may be identifiedimperatively by analyzing different types of objects in the softwaredevelopment infrastructure. This imperative analysis may be performedwhen the types of objects differ between layers, i.e., when differentlayers include different types of objects such that layers can bedistinguished based on the types of objects included therein. Thus, insuch an embodiment, objects in the software development infrastructuremay be enumerated without recursively enumerating layers, andrelationships between and among components can be determined withrespect to layers based on the types of components.

Alternatively or in addition, steps may be identified based on triggersbetween pipelines. More specifically, connections between components ofdifferent pipelines may be identified based on execution of a firstpipeline triggering a second pipeline's execution. When there is asoftware dependency between a component built by the first pipeline anda component built by the second pipeline, execution of the firstpipeline results in execution of the second pipeline when the componentof the first pipeline calls the component of the second pipeline. Insuch a case, recursive analysis of the first pipeline may proceed intoanalyzing the second pipeline, thereby completing the analysis of theentire process starting with the first pipeline and resulting inexecution of the second pipeline.

At S630, the enumerated steps are mapped with respect to components of asoftware development infrastructure in order to create a mapping thatincludes the relative locations of steps within the software developmentpipeline. In various embodiments, the steps are mapped at least withrespect to each other within the pipeline.

The relative location of a given step with respect to other componentsof the software development infrastructure is defined at least withrespect to connections between and among components of the softwaredevelopment infrastructure, and may further be defined with respect toorder of processing related to those connections.

The connections may include passing arguments, passing outputs, and thelike, from one component to another (e.g., from one step to another), orotherwise based on the use of the results of one component by anothercomponent. As a non-limiting example, a connection may be defined asartifacts built by one step being scanned by another step or argumentsused by one step being passed to another step.

The order may be based on the flow of data between the connected steps,e.g., data output or processed by a first step in a given order may besubsequently passed to or processed by a second step that is identifiedas being later in the order. As a non-limiting example, code created atone step may be analyzed by another step. As another non-limitingexample, code scanned at one step may be deployed in another step.

In at least some embodiments, the steps are mapped with respect to anentity graph indicating entities and connections between entities in thesoftware development or SDLC pipeline. In a further embodiment, theentity graph may be part of a knowledge base constructed, for example,as described above with respect to FIG. 4 . Creating entity graphs andknowledge bases for software development pipelines are described furtherin U.S. patent application Ser. No. 17/507,180, assigned to the commonassignee, the contents of which are hereby incorporated by reference.The knowledge graph may further include step data associated with eachmapped step which may be indicative of various properties of the mappedsteps, and can therefore be queried for this step data.

At S640, the enumerated steps are classified. The classification isbased on step properties of each step such as, but not limited to,provider, type, name, arguments, combinations thereof, and the like. Insome embodiments, S640 further includes normalizing the step data whichmay indicate such step properties. Further, S640 may also includeparsing and interpreting text of arguments in order to semanticallyanalyze the arguments used by those steps, thereby improving theclassification as compared to solutions which categorize steps basedsolely on name and/or task descriptions.

At S650, a mapping is updated based on the classifications of the steps.The mapping may be, for example, a mapping of entities in an entitygraph. The mapping is updated so as to include the classifications ofthe step, for example, as properties of the mapped steps.

FIG. 7 is an example flowchart S340 illustrating a method for alertmanagement according to an embodiment.

At S710, alerts are obtained from detection tools. In accordance withvarious disclosed embodiments, the alerts are received from detectiontools which monitor for events or other findings with respect todifferent parts of the software development pipeline (e.g., coding,building, deployment, staging, production). The alerts from differentdetection tools and/or related to different parts of the softwaredevelopment pipeline may be formatted differently and may indicatedifferent software components which may be involved for any given cyberthreat.

At S720, the obtained alerts are normalized. In an embodiment, thealerts are normalized into a unified notation. As noted above, alertsfrom different sources (e.g., different detection tools) may beformatted differently, even if those alerts contain similar information.Normalizing the alerts into a unified format allows for effectivelycomparing between differently formatted alerts. The alerts may benormalized, for example, with respect to universal definition templatesas discussed above with respect to FIG. 5 .

At S730, the alerts are analyzed in order to match the alerts withrespect to issues indicated therein. In an embodiment, S730 includesanalyzing the text of the alerts to identify related issue-indicatingtext or otherwise analyzing the alerts for predefined similar issues.Alternatively or in combination, any or all of the alerts may includedata in a machine-readable format, and S730 may include analyzingcertain fields or attributes of such machine-readable format data inorder to identify similar fields or attributes.

In an embodiment, S730 includes analyzing common traits or indicatorssuch as, but not limited to, common vulnerabilities and exposures (CVEs)indicated in alerts, in order to determine which common traits orindicators are included in each alert and comparing those common traitsor indicators to determine which alerts relate to the same kind ofissue. The common traits or indicators may be predetermined traits orindicators used by different detection tools such that they arerepresented in the same manner in alerts from those different detectiontools.

In a further embodiment, one or more matching rules may be applied thatdefine requirements for matching alerts based on such common traits orindicators. Such rules may require, for example, matching at least onecommon trait or indicator, matching a threshold number of common traitsand/or indicators, matching particular sets of common traits and/orindicators, combinations thereof, and the like. In this regard, it isnoted that CVEs included in alerts include standardized identifiers forparticular vulnerabilities and exposures. These standardized identifierswill therefore demonstrate the type of issues involved in the alert in aformat that is directly comparable to that of other alerts.

In yet a further embodiment, S730 may further include analyzing metainformation about the common trait or indicator (e.g., a CVE) andrelevant packages such as, but not limited to, version. This allows forfurther improving the granularity of the comparison and, therefore, theaccuracy of the matching.

At S740, correlations among software components related to the alertsare identified. The correlations may include, but are not limited to,correlations between portions of source code with discrete softwarecomponents (e.g., correlations between build files and particularsoftware containers). As a non-limiting example, a correlation may beidentified between a build file containing instructions for creating agiven container image and the software container corresponding to thatcontainer image.

In an embodiment, S740 includes querying a data structure (e.g., adatabase) storing an inventory of associations between softwarecomponents among different components of the software developmentpipeline. In accordance with various disclosed embodiments, such a datastructure includes the entity graph as described herein. In a furtherembodiment, the correlations database is created using an attributionprocess, where the correlations in the database are based on theattributions. In yet a further embodiment, at least a portion of theattribution process is performed as described in U.S. patent applicationSer. No. 17/656,914, assigned to the common assignee, the contents ofwhich are hereby incorporated by reference.

More specifically, in an embodiment, the inventory at least includesassociations between build files and configuration files, with eachconfiguration file corresponding to a respective software container.Accordingly, querying a data structure including such an inventory usinga given file allows for identifying correlations across differentportions of the software development pipeline. By identifyingcorrelations between components indicated by alerts in differentportions of the software development pipeline as well as matching thealerts themselves, alerts which relate to the same underlying issue orthreat may be identified as duplicates with a high degree of accuracy,thereby allowing for accurate deduplication and prioritization ofalerts.

In an embodiment, S740 includes marking the alerts with the identifiedcorrelations. As a non-limiting example, when a correlation between asoftware container “SC1” and a build file “BF1” is identified withrespect to SC1 and BF1 being indicated in alert messages from differentdetection tools, an alert indicating SC1 may be marked as also relatingto BF1 and vice versa, i.e., an alert indicating BF1 may be marked asalso related to SC1.

At S750, alerts from different detection tools are matched in order toidentify one or more sets (i.e., groups) of duplicate alerts. Each setof matching alerts demonstrates relationships across different portionsof the software development infrastructure realized as a combination ofat least source verification and correlations. In other words, in anembodiment, two alerts are determined to be duplicates of each otherwhen they both indicate correlated software components and relate to thesame type of issue.

In this regard, it has been identified that matching alerts based oncommon traits or indicators such as CVEs alone does not allow foraccurately identifying duplicate alerts since the same trait in twodifferent alerts indicates that those alerts might relate to the samekind of issue, but not necessarily to the same specific issue or rootcause. By both identifying the same kind of issue (e.g., based on CVEs)and identifying related software components (e.g., based on correlationsbetween data such as build and configuration files for the same softwarecontainer) indicated in two alerts, those alerts can be identified asduplicates with a high degree of accuracy. In other words, two alertsthat relate to the same issue (i.e., including the same CVEs) in whichsoftware components indicated in one alert are linked to softwarecomponents in the other alert can be said to be alerts for the sameunderlying issue (and therefore duplicates) with a high degree ofaccuracy.

At S760, the alerts are deduplicated based on the matching.Deduplicating the alerts may include, but is not limited to, groupingtogether matching alerts or removing redundant instances of matchingalerts such that only one instance of each unique alert remains acrossalerts generated by different tools.

At S770, the alerts are prioritized. The prioritization may be performedusing one or more prioritization rules, and more specificallyprioritization rules that define how to prioritize alerts with respectto a mapping of the software infrastructure (e.g., the mapping in theentity graph described herein. The prioritization rules may defineconditions including, but not limited to, specific components, types ofcomponents, relative locations within a software infrastructure,connections (e.g., connections to specific other components),combinations thereof, and the like, and may rank the aforementioneditems. The ranking may be used to prioritize alerts. For example, acomponent who meets one or more conditions which are ranked higher thanthe conditions met by another component may be prioritized over thatother component. Further, a weighted scoring scheme may be utilized forinstances where a component might meet more than one condition, and theprioritization rules may further define the weighted scoring schemeincluding any applicable weights.

As a non-limiting example, alerts are obtained from at least two tools:a Lacework™ detection tool and a Snyk™ detection tool. In this example,the Lacework™ tool generates an alert for a container image hosted in acustomer's software container registry. Accordingly, the Lacework™ toolgenerates alerts related to a build artifact of the software developmentpipeline. The Snyk™ tool generates an alert based on source code of thecustomer, i.e., the Snyk™ tool generates alerts related to a codingphase of the software development pipeline. In this example, both theLacework™ tool and the Snyk™ tool generate an alert indicating the CVEwith the identifier “CVE-2022-24434,” which indicates that a softwarecomponent being developed may be vulnerable to Denial of Service (DoS)attacks.

In this example, based on the common CVE, the alerts are furtheranalyzed for components indicated therein, and the container image inthe Lacework™ tool alert is identified. A database is queried with aconfiguration file of the container image, and the database returns aconnection between the configuration file (and, consequently, thecontainer image itself) and a Docker file (a type of build file) that isindicated in the Snyk™ tool alert. The alerts generated by the Lacework™tool and the Snyk™ tool may each be marked with the correlation.Accordingly, the Lacework™ tool and the Snyk™ tool alerts are identifiedas duplicates of each other and managed accordingly. In particular,either the alerts are combined into a single alerts summary or otherwiseone of the alerts is removed, thereby reducing the total numbers ofalerts to be addressed.

FIG. 8 is a flowchart S370 illustrating a method for remediatingcybersecurity events based on entity-identifying values and semanticconcepts according to an embodiment. In an embodiment, the method isperformed by the pipeline manager 100, FIG. 1 .

At S810, cybersecurity event data is identified within alerts.Identifying the cybersecurity event data may include, but is not limitedto, applying event identification rules which define procedures forparsing alerts in order to identify events contained therein. Such eventidentification rules may further include definitions of known eventformatting, known organization of events within alerts, keywords knownto be indicative of events, combinations thereof, and the like.

In some embodiments, the cybersecurity event may be a simulatedcybersecurity event or otherwise the cybersecurity event data may besimulated cybersecurity event data such that the method may begin evenif an actual cybersecurity event has not yet occurred (e.g., before analert has triggered or otherwise before the cybersecurity event isindicated in cybersecurity event data). Such simulated data may beprovided via user inputs, may be randomly generated, and the like. Usingsimulated cybersecurity events allows for proactively testing thesoftware infrastructure, which in turn can be utilized to remediateproblems before the software infrastructure actually experiences thoseproblems.

At S820, the cybersecurity event data is semantically analyzed. In anembodiment, S820 includes extracting semantic keywords from textualcontent included in the cybersecurity event data. Such textual contentmay include, but is not limited to, text of an alert or log, text of apolicy or other event logic component linked to a cybersecurity event(e.g., code defining detection logic used to detect the cybersecurityevent, a query which resulted in the alert triggering, etc.), a machinereadable representation of an alert (e.g., a JSON or XML representationof the alert), combinations thereof, and the like. To this end, in afurther embodiment, S820 may further include performing natural languageprocessing on such text in order to identify known semantic concepts(e.g., semantic concepts defined in a semantic concepts dictionary) andto extract the identified semantic concepts. Alternatively orcollectively, S820 may further include mapping from tokens of a machinereadable representation to semantic concepts, where the mapping may beexplicitly defined or learned using machine learning.

At S830, entity-identifying values are extracted from the cybersecurityevent data. In an embodiment, S830, includes applying one or more entityidentification rules in order to identify the values to be extractedfrom the cybersecurity event data. Such rules may define, for examplebut not limited to, fields that typically contain entity-identifyingvalues, common formats of entity-identifying values, other indicators ofa value that represents a specific entity, and the like. Theentity-identifying values may include, but are not limited to, valueswhich identify a specific entity, values which indicate groups to whichan entity belongs (e.g., a name of a resource group to which the entitybelongs), both, and the like. Alternatively or collectively, a machinelearning model trained to extract entity-identifying values may beapplied to the cybersecurity event data.

At S840, a query is generated and applied based on the semantic analysisand the entity-identifying values. In an embodiment, the query includesboth one or more semantic concepts as well as one or moreentity-identifying values.

The query may be generated based on a predetermined query language. Sucha query language may be designed for the purpose of harnessing logicaldeduction rules for querying entity graphs or relational databases inorder to obtain relevant information for development, security, andoperations for the various domains of a software infrastructure.Alternatively, the query may be generated in a general purpose querylanguage. In some implementations, the query language may becustom-defined to allow for customization of queries for a specificenvironment (e.g., a cloud environment used by a specific company) in amanner that can scale up to different stacks.

In an embodiment, the query is applied using a fuzzy matching processbased on a predetermined template. The fuzzy matching process yieldsresults indicating an event logic component (e.g., a policy, codedefining business logic, a query, a portion thereof, etc.) and asoftware component entity among the entity graph that most closelymatches the event logic component and software component entitiesindicated in the text of the cybersecurity event data.

It should be noted that steps S820 through S840 are described in someembodiments as being potentially performed when an alert has alreadybeen received, but that the disclosed embodiments are not limited tosuch an implementation. In particular, an alert may be semanticallyanalyzed prior to the alert actually being triggered, for example byusing the alert as simulated cybersecurity event data. In this regard,it is noted that some forms of cybersecurity event data such as alertsmay use predetermined text that is included in notifications when thealert is generated. Accordingly, such predetermined text can besemantically analyzed before the alert is actually received, and theresults of the prior semantic analysis may be used as described herein.

At S850, one or more paths between a discrete portion of event logicrelated to the cybersecurity event and an entity in a softwareinfrastructure are identified within an entity graph (e.g., the entitygraph created as described above with respect to S450) based on theresults of the query. As noted above, the generated query includes bothsemantic concepts and entity-identifying values extracted from thecybersecurity event data, which indicates both entities involved in theevent that resulted in the cybersecurity event data being generated orprovided and the event logic related to the cybersecurity event (e.g.,event logic of a policy which triggered an alert for the cybersecurityevent, business logic which was used to generate log data indicating thecybersecurity event, queries about the cause of a cybersecurity event,etc.). Using these concepts and values to query the entity graph allowsfor identifying paths between specific entities of the softwareinfrastructure and event logic related to the cybersecurity event.

In some implementations, multiple paths are identified between the eventlogic component and the software component, and one or more root causepaths are determined as the paths to use for subsequent processing. Eachroot cause path may be, for example but not limited to, a shortest pathamong paths (e.g., one of the paths having the fewest links connectingnodes from a node representing a policy indicated by an alert to a noderepresenting the entity indicated in the cybersecurity event data).

At S860, one or more root cause entities are identified based on thepaths. The root cause entities may be entities associated with eventlogic related to the cause of a cybersecurity event indicated in thecybersecurity event data such as, but not limited to, each softwarecomponent of the software infrastructure that is connected to a policywhich triggered an alert via the identified at least one path. The rootcause entities are collectively determined as the root cause of thecybersecurity event. As a non-limiting example, a root cause entity maybe an entity containing faulty code (e.g., a file or container) whichcaused an alert to trigger. By identifying the entities which are theroot cause of a cybersecurity event, more accurate and specificinformation about the cause of the cybersecurity event can be provided,and appropriate remedial actions involving those entities may bedetermined.

At S870, fix determination rules are applied with respect to theidentified root causes. The fix determination rules may be, but are notlimited to, predetermined rules defining known fixes for respective rootcauses. The fix determination rules may further be defined with respectto locations within the software infrastructure which may be indicatedin entity graphs as described herein. That is, the fix determinationrules may define fixes which are known to correct certain types of rootcauses generally, may specifically define fixes for certain types ofroot causes when they occur in a particular location relative to therest of the computing infrastructure, or both. To this end, S870 mayfurther include identifying locations of the root cause entities withinthe software infrastructure based on the mapping.

At S880 a fix action plan is generated based on the results of applyingthe fix determination rules. The fix action plan includes one or moreremedial actions, and may provide indications of entities asdemonstrated in the entity graph or otherwise indicate components withinthe software infrastructure to which the fix action plan should beapplied, owners of those components who should apply part of all of thefix action plan, or other information represented in the entity graphrelevant to implementing the fix action plan. The fix action plan may berealized as human-readable data (e.g., text), as computer-readableinstructions (i.e., code), and the like. When the fix action is to beimplemented automatically, the fix action plan may further includeinstructions for performing the remedial actions.

The remedial actions may include, but are not limited to, generating andsending a notification, performing mitigation actions such as changingconfigurations of software components, changing code of softwarecomponents, combinations thereof, and the like. As a non-limitingexample, a configuration of a root cause entity that is a softwarecomponent may be changed from “allow” to “deny” with respect to aparticular capability of the software component, thereby mitigating thecause of the cybersecurity event.

When the remedial action includes generating a notification, S880 mayfurther include determining to which person the notification should besent. In implementations where the entity graph includes nodesrepresenting code owners, the entity to which the notification should besent may be a person, team, business unit, and the like, represented bya code owner node linked to the root cause entity node in the entitygraph. As noted above, by using known links between software componentsand code owners, an appropriate person to investigate or fix an issuecan be automatically and accurately identified.

Additionally, when the remedial action includes generating anotification, the notification may further indicate a degree of risk ofthe underlying issue. Such a degree of risk may be determined based on,for example, the semantic analysis of the cybersecurity event data, textof the cybersecurity event data, a known risk level associated withevent logic components related to the cybersecurity event indicated inthe cybersecurity event data, a predetermined degree of importance ofthe root cause entities, a number of edges connecting the root causeentities to other software components of the entity graph, a number ofedges connecting an entity in the path to a known security risk, acombination thereof, and the like. Such a degree of risk may serve todemonstrate the urgency needed for responding to the issue to a userbeing notified of the issue, which may help in determining how toprioritize fixing the issue.

FIG. 9 is an example schematic diagram of a hardware layer of thepipeline manager 100 according to an embodiment. The system 130 includesa processing circuitry 910 coupled to a memory 920, a storage 930, and anetwork interface 940. In an embodiment, the components of the pipelinemanager 100 may be communicatively connected via a bus 950.

The processing circuitry 910 may be realized as one or more hardwarelogic components and circuits. For example, and without limitation,illustrative types of hardware logic components that can be used includefield programmable gate arrays (FPGAs), application-specific integratedcircuits (ASICs), Application-specific standard products (ASSPs),system-on-a-chip systems (SOCs), graphics processing units (GPUs),tensor processing units (TPUs), general-purpose microprocessors,microcontrollers, digital signal processors (DSPs), and the like, or anyother hardware logic components that can perform calculations or othermanipulations of information.

The memory 920 may be volatile (e.g., random access memory, etc.),non-volatile (e.g., read only memory, flash memory, etc.), or acombination thereof.

In one configuration, software for implementing one or more embodimentsdisclosed herein may be stored in the storage 930. In anotherconfiguration, the memory 920 is configured to store such software.Software shall be construed broadly to mean any type of instructions,whether referred to as software, firmware, middleware, microcode,hardware description language, or otherwise. Instructions may includecode (e.g., in source code format, binary code format, executable codeformat, or any other suitable format of code). The instructions, whenexecuted by the processing circuitry 910, cause the processing circuitry910 to perform the various processes described herein.

The storage 930 may be magnetic storage, optical storage, and the like,and may be realized, for example, as flash memory or other memorytechnology, compact disk-read only memory (CD-ROM), Digital VersatileDisks (DVDs), or any other medium which can be used to store the desiredinformation.

The network interface 940 allows the pipeline manager 100 to communicatewith, for example but not limited to, tenant resources (e.g., resourcesstoring data related to computing infrastructure pipelines), third partyinfrastructure management tools (e.g., code repositories, ticketing ornotification systems, CI/CD managers, identity providers, code scanners,IaC tools, container repositories, automated security tools, cloudprovider infrastructures, vulnerability management tools, etc.), both,and the like.

It should be understood that the embodiments described herein are notlimited to the specific architecture illustrated in FIG. 9 , and otherarchitectures may be equally used without departing from the scope ofthe disclosed embodiments.

It is important to note that the embodiments disclosed herein are onlyexamples of the many advantageous uses of the innovative teachingsherein. In general, statements made in the specification of the presentapplication do not necessarily limit any of the various claimedembodiments. Moreover, some statements may apply to some inventivefeatures but not to others. In general, unless otherwise indicated,singular elements may be in plural and vice versa with no loss ofgenerality. In the drawings, like numerals refer to like parts throughseveral views.

The various embodiments disclosed herein can be implemented as hardware,firmware, software, or any combination thereof. Moreover, the softwaremay be implemented as an application program tangibly embodied on aprogram storage unit or computer readable medium consisting of parts, orof certain devices and/or a combination of devices. The applicationprogram may be uploaded to, and executed by, a machine comprising anysuitable architecture. Preferably, the machine is implemented on acomputer platform having hardware such as one or more central processingunits (“CPUs”), a memory, and input/output interfaces. The computerplatform may also include an operating system and microinstruction code.The various processes and functions described herein may be either partof the microinstruction code or part of the application program, or anycombination thereof, which may be executed by a CPU, whether or not sucha computer or processor is explicitly shown. In addition, various otherperipheral units may be connected to the computer platform such as anadditional data storage unit and a printing unit. Furthermore, anon-transitory computer readable medium is any computer readable mediumexcept for a transitory propagating signal.

All examples and conditional language recited herein are intended forpedagogical purposes to aid the reader in understanding the principlesof the disclosed embodiment and the concepts contributed by the inventorto furthering the art, and are to be construed as being withoutlimitation to such specifically recited examples and conditions.Moreover, all statements herein reciting principles, aspects, andembodiments of the disclosed embodiments, as well as specific examplesthereof, are intended to encompass both structural and functionalequivalents thereof. Additionally, it is intended that such equivalentsinclude both currently known equivalents as well as equivalentsdeveloped in the future, i.e., any elements developed that perform thesame function, regardless of structure.

It should be understood that any reference to an element herein using adesignation such as “first,” “second,” and so forth does not generallylimit the quantity or order of those elements. Rather, thesedesignations are generally used herein as a convenient method ofdistinguishing between two or more elements or instances of an element.Thus, a reference to first and second elements does not mean that onlytwo elements may be employed there or that the first element mustprecede the second element in some manner. Also, unless statedotherwise, a set of elements comprises one or more elements.

As used herein, the phrase “at least one of” followed by a listing ofitems means that any of the listed items can be utilized individually,or any combination of two or more of the listed items can be utilized.For example, if a system is described as including “at least one of A,B, and C,” the system can include A alone; B alone; C alone; 2A; 2B; 2C;3A; A and B in combination; B and C in combination; A and C incombination; A, B, and C in combination; 2A and C in combination; A, 3B,and 2C in combination; and the like.

What is claimed is:
 1. A method for automating alert remediation,comprising: extracting a plurality of entity-identifying values fromcybersecurity event data included in a plurality of alerts generated fora software infrastructure; generating at least one query based on theplurality of entity-identifying values; querying an entity graph usingthe at least one query, wherein the entity graph has a plurality ofnodes representing respective entities of the plurality of entities,wherein the plurality of entities includes a plurality of softwarecomponents of the software infrastructure and a plurality of event logiccomponents of cybersecurity event logic deployed with respect to thesoftware infrastructure; identifying at least one path in the entitygraph based on the results of the at least one query, wherein eachidentified path is between one of the plurality of software componentsand one of the plurality of event logic components; identifying at leastone root cause entity based on the identified at least one path; andgenerating a fix action plan for the plurality of alerts based on theidentified at least one root cause entity.
 2. The method of claim 1,wherein the at least one query is generated based further on a pluralityof semantic concepts, further comprising: creating a semantic conceptsdictionary, wherein the semantic concepts dictionary defines a pluralityof semantic concepts describing potential characteristics for theplurality of software components, wherein the plurality of semanticconcepts is extracted from the cybersecurity event data using thesemantic concepts dictionary.
 3. The method of claim 1, whereingenerating the fix action plan further comprises: applying a pluralityof fix determination rules based on the identified at least one rootcause entity, wherein the plurality of fix determination rules define aplurality of predetermined fixes for a plurality of types of rootcauses.
 4. The method of claim 3, wherein the plurality of predeterminedfixes are defined further with respect to locations within the softwareinfrastructure, wherein applying the plurality of fix determinationrules further comprises: identifying a location of each of the at leastone root cause entity within the software infrastructure based on theentity graph.
 5. The method of claim 1, wherein the fix action planincludes a plurality of computer-readable instructions, furthercomprising: executing the plurality of computer-readable instructions inorder to implement at least a portion of the fix action plan, whereinthe plurality of computer-readable instructions, when executed by aprocessing circuitry, configure the processing circuitry to perform atleast one mitigation action defined in the fix action plan.
 6. Themethod of claim 1, wherein the fix action plan defines at least onemitigation action, further comprising: integrating with a plurality ofnative development lifecycle tools deployed with respect to the softwareinfrastructure, wherein the at least one mitigation action utilizes atleast one tool of the plurality of native development lifecycle tools.7. The method of claim 6, wherein each of the at least one root causeentity has a respective location within the software infrastructurerepresented in the entity graph, wherein the at least one tool used forthe at least one mitigation action is deployed with respect to thelocation of each of the at least one root cause entity.
 8. The method ofclaim 1, wherein the entity graph further includes a plurality of ownernodes representing a plurality of respective owners of the plurality ofsoftware components, further comprising: generating at least onenotification based on the fix action plan; and sending each of thegenerated at least one notification to a respective owner of theplurality of owners.
 9. The method of claim 1, further comprising:analyzing the plurality of alerts in order to identify a plurality ofmatches between alerts of the plurality of alerts, wherein the pluralityof matches is identified with respect to issues indicated in theplurality of alerts; identifying a plurality of correlations betweenrespective software components of the plurality of software componentslinked to the plurality of alerts based on the entity graph; anddeduplicating the plurality of alerts based on the plurality of matchesand the plurality of correlations.
 10. The method of claim 1, furthercomprising: prioritizing the plurality of alerts by applying at leastone alert prioritization rule, wherein the at least one alertprioritization rule defines how to prioritize alerts with respect to amapping of the entity graph.
 11. The method of claim 1, furthercomprising: identifying a first plurality of properties in a pluralityof original definitions of a plurality of computing infrastructureresources, wherein each original definition is a definition of arespective software component of the plurality of software components;mapping the first plurality of properties to a second plurality ofproperties of a plurality of universal definition templates in order todetermine a matching universal definition template for each originaldefinition, wherein each of the plurality of universal definitionscorresponds to a respective type of computing infrastructure resourceand is defined in a unified format; and transforming the plurality oforiginal definitions into a plurality of universal definitions using theplurality of universal definition templates.
 12. A non-transitorycomputer readable medium having stored thereon instructions for causinga processing circuitry to execute a process, the process comprising:extracting a plurality of entity-identifying values from cybersecurityevent data included in a plurality of alerts generated for a softwareinfrastructure; generating at least one query based on the plurality ofentity-identifying values; querying an entity graph using the at leastone query, wherein the entity graph has a plurality of nodesrepresenting respective entities of the plurality of entities, whereinthe plurality of entities includes a plurality of software components ofthe software infrastructure and a plurality of event logic components ofcybersecurity event logic deployed with respect to the softwareinfrastructure; identifying at least one path in the entity graph basedon the results of the at least one query, wherein each identified pathis between one of the plurality of software components and one of theplurality of event logic components; identifying at least one root causeentity based on the identified at least one path; and generating a fixaction plan for the plurality of alerts based on the identified at leastone root cause entity.
 13. A system for automating alert remediation,comprising: a processing circuitry; and a memory, the memory containinginstructions that, when executed by the processing circuitry, configurethe system to: extracting a plurality of entity-identifying values fromcybersecurity event data included in a plurality of alerts generated fora software infrastructure; generate at least one query based on theplurality of entity-identifying values; query an entity graph using theat least one query, wherein the entity graph has a plurality of nodesrepresenting respective entities of the plurality of entities, whereinthe plurality of entities includes a plurality of software components ofthe software infrastructure and a plurality of event logic components ofcybersecurity event logic deployed with respect to the softwareinfrastructure; identify at least one path in the entity graph based onthe results of the at least one query, wherein each identified path isbetween one of the plurality of software components and one of theplurality of event logic components; identify at least one root causeentity based on the identified at least one path; and generate a fixaction plan for the plurality of alerts based on the identified at leastone root cause entity.
 14. The system of claim 13, wherein the at leastone query is generated based further on a plurality of semanticconcepts, wherein the system is further configured to: create a semanticconcepts dictionary, wherein the semantic concepts dictionary defines aplurality of semantic concepts describing potential characteristics forthe plurality of software components, wherein the plurality of semanticconcepts is extracted from the cybersecurity event data using thesemantic concepts dictionary.
 15. The system of claim 13, wherein thesystem is further configured to: apply a plurality of fix determinationrules based on the identified at least one root cause entity, whereinthe plurality of fix determination rules define a plurality ofpredetermined fixes for a plurality of types of root causes.
 16. Thesystem of claim 15, wherein the plurality of predetermined fixes aredefined further with respect to locations within the softwareinfrastructure, wherein the system is further configured to: Identify alocation of each of the at least one root cause entity within thesoftware infrastructure based on the entity graph.
 17. The system ofclaim 13, wherein the fix action plan includes a plurality ofcomputer-readable instructions, wherein the system is further configuredto: execute the plurality of computer-readable instructions in order toimplement at least a portion of the fix action plan, wherein theplurality of computer-readable instructions, when executed by aprocessing circuitry, configure the processing circuitry to perform atleast one mitigation action defined in the fix action plan.
 18. Thesystem of claim 13, wherein the fix action plan defines at least onemitigation action, wherein the system is further configured to:integrate with a plurality of native development lifecycle toolsdeployed with respect to the software infrastructure, wherein the atleast one mitigation action utilizes at least one tool of the pluralityof native development lifecycle tools.
 19. The system of claim 18,wherein each of the at least one root cause entity has a respectivelocation within the software infrastructure represented in the entitygraph, wherein the at least one tool used for the at least onemitigation action is deployed with respect to the location of each ofthe at least one root cause entity.
 20. The system of claim 13, whereinthe entity graph further includes a plurality of owner nodesrepresenting a plurality of respective owners of the plurality ofsoftware components, wherein the system is further configured to:generate at least one notification based on the fix action plan; andsend each of the generated at least one notification to a respectiveowner of the plurality of owners.
 21. The system of claim 13, whereinthe system is further configured to: analyze the plurality of alerts inorder to identify a plurality of matches between alerts of the pluralityof alerts, wherein the plurality of matches is identified with respectto issues indicated in the plurality of alerts; identify a plurality ofcorrelations between respective software components of the plurality ofsoftware components linked to the plurality of alerts based on theentity graph; and deduplicate the plurality of alerts based on theplurality of matches and the plurality of correlations.
 22. The systemof claim 13, wherein the system is further configured to: prioritize theplurality of alerts by applying at least one alert prioritization rule,wherein the at least one alert prioritization rule defines how toprioritize alerts with respect to a mapping of the entity graph.
 23. Thesystem of claim 13, wherein the system is further configured to:identify a first plurality of properties in a plurality of originaldefinitions of a plurality of computing infrastructure resources,wherein each original definition is a definition of a respectivesoftware component of the plurality of software components; map thefirst plurality of properties to a second plurality of properties of aplurality of universal definition templates in order to determine amatching universal definition template for each original definition,wherein each of the plurality of universal definitions corresponds to arespective type of computing infrastructure resource and is defined in aunified format; and transform the plurality of original definitions intoa plurality of universal definitions using the plurality of universaldefinition templates.